NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Open-World Evaluation for Retrieving Diverse Perspectives

https://doi.org/10.18653/v1/2025.naacl-long.431

Chen, Hung-Ting; Choi, Eunsol (April 2025, Association for Computational Linguistics)

We study retrieving a set of documents that covers various perspectives on a complex and contentious question (e.g., will ChatGPT do more harm than good?). We curate a Benchmark for Retrieval Diversity for Subjective questions (BERDS), where each example consists of a question and diverse perspectives associated with the question, sourced from survey questions and debate websites. On this data, retrievers paired with a corpus are evaluated to surface a document set that contains diverse perspectives. Our framing diverges from most retrieval tasks in that document relevancy cannot be decided by simple string matches to references. Instead, we build a language model-based automatic evaluator that decides whether each retrieved document contains a perspective. This allows us to evaluate the performance of three different types of corpus (Wikipedia, web snapshot, and corpus constructed on the fly with retrieved pages from the search engine) paired with retrievers. Retrieving diverse documents remains challenging, with the outputs from existing retrievers covering all perspectives on only 33.74% of the examples. We further study the impact of query expansion and diversity-focused reranking approaches and analyze retriever sycophancy. Together, we lay the foundation for future studies in retrieval diversity handling complex queries.
more » « less
Free, publicly-accessible full text available April 1, 2026
Textless Speech-to-Speech Translation With Limited Parallel Data

https://doi.org/10.18653/v1/2024.findings-emnlp.951

Diwan, Anuj; Srinivasan, Anirudh; Harwath, David; Choi, Eunsol (November 2024, Association for Computational Linguistics)

Full Text Available
AmbigDocs: Reasoning across Documents on Different Entities under the Same Name

Lee, Yoonsang; Ye, Xi; Choi, Eunsol (October 2024, CONFERENCE ON LANGUAGE MODELING)

Different entities with the same name can be difficult to distinguish. Handling confusing entity mentions is a crucial skill for language models (LMs). For example, given the question “Where was Michael Jordan educated?” and a set of documents discussing different people named Michael Jordan, can LMs distinguish entity mentions to generate a cohesive answer to the question? To test this ability, we introduce a new benchmark, AmbigDocs. By leveraging Wikipedia’s disambiguation pages, we identify a set of documents, belonging to different entities who share an ambiguous name. From these documents, we generate questions containing an ambiguous name and their corresponding sets of answers. Our analysis reveals that current state-of-the-art models often yield ambiguous answers or incorrectly merge information belonging to different entities. We establish an ontology categorizing four types of incomplete answers and automatic evaluation metrics to identify such categories. We lay the foundation for future work on reasoning across multiple documents with ambiguous entities.
more » « less
Full Text Available
Long-Form Answers to Visual Questions from Blind and Low Vision People

Huh, Mina; Xu, Fangyuan; Peng, Yi-Hao; Chen, Congyan; Murugu, Hansika; Gurari, Danna; Choi, Eunsol; Pavel, Amy (October 2024, Conference on Language Modeling)

Vision language models can now generate long-form answers to questions about images -- long-form visual question answers (LFVQA). We contribute VizWiz-LF, a dataset of long-form answers to visual questions posed by blind and low vision (BLV) users. VizWiz-LF contains 4.2k long-form answers to 600 visual questions, collected from human expert describers and six VQA models. We develop and annotate functional roles of sentences of LFVQA and demonstrate that long-form answers contain information beyond the question answer such as explanations and suggestions. We further conduct automatic and human evaluations with BLV and sighted people to evaluate long-form answers. BLV people perceive both human-written and generated long-form answers to be plausible, but generated answers often hallucinate incorrect visual details, especially for unanswerable visual questions (e.g., blurry or irrelevant images). To reduce hallucinations, we evaluate the ability of VQA models to abstain from answering unanswerable questions across multiple prompting strategies.
more » « less
Full Text Available
Complex Claim Verification with Evidence Retrieved in the Wild

Chen, Jifan; Kim, Grace; Sriram, Aniruddh; Durrett, Greg; Choi, Eunsol (June 2024, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers))

Retrieving evidence to support or refute claims is a core part of automatic fact-checking. Prior work makes simplifying assumptions in retrieval that depart from real-world use cases: either no access to evidence, access to evidence curated by a human fact-checker, or access to evidence published after a claim was made. In this work, we present the first realistic pipeline to check real-world claims by retrieving raw evidence from the web. We restrict our retriever to only search documents available prior to the claim’s making, modeling the realistic scenario of emerging claims. Our pipeline includes five components: claim decomposition, raw document retrieval, fine-grained evidence retrieval, claim-focused summarization, and veracity judgment. We conduct experiments on complex political claims in the ClaimDecomp dataset and show that the aggregated evidence produced by our pipeline improves veracity judgments. Human evaluation finds the evidence summary produced by our system is reliable (it does not hallucinate information) and relevant to answering key questions about a claim, suggesting that it can assist fact-checkers even when it does not reflect a complete evidence set.
more » « less
Full Text Available
Aligning Data with the Goals of an Organization and Its Workers: Designing Data Labeling for Social Service Case Notes

https://doi.org/10.1145/3613904.3642014

Gondimalla, Apoorva; Sreekanth, Varshinee; Joshi, Govind; Nelson, Whitney; Choi, Eunsol; Slota, Stephen C; Greenberg, Sherri R; Fleischmann, Kenneth R; Lee, Min Kyung (May 2024, ACM)

Full Text Available
Propagating Knowledge Updates to LMs Through Distillation

Padmanabhan, Shankar; Onoe, Yasumasa; Zhang, Michael JQ; Durrett, Greg; Choi, Eunsol (December 2023, Advances in neural information processing systems)

Modern language models have the capacity to store and use immense amounts of knowledge about real-world entities, but it remains unclear how to update such knowledge stored in model parameters. While prior methods for updating knowledge in LMs successfully inject atomic facts, updated LMs fail to make inferences based on injected facts. In this work, we demonstrate that a context distillation-based approach can both impart knowledge about entities and propagate that knowledge to enable broader inferences. Our approach consists of two stages: transfer set generation and distillation on the transfer set. We first generate a transfer set by prompting a language model to generate continuations from the entity definition. Then, we update the model parameters so that the distribution of the LM (the student) matches the distribution of the LM conditioned on the definition (the teacher) on the transfer set. Our experiments demonstrate that this approach is more effective at propagating knowledge updates than fine-tuning and other gradient-based knowledge-editing methods. Moreover, it does not compromise performance in other contexts, even when injecting the definitions of up to 150 entities at once.
more » « less
Full Text Available
A Critical Evaluation of Evaluations for Long-form Question Answering

https://doi.org/10.18653/v1/2023.acl-long.181

Xu, Fangyuan; Song, Yixiao; Iyyer, Mohit; Choi, Eunsol (January 2023, Association for Computational Linguistics)

Full Text Available
Continually Improving Extractive QA via Human Feedback

https://doi.org/10.18653/v1/2023.emnlp-main.27

Gao, Ge; Chen, Hung-Ting; Artzi, Yoav; Choi, Eunsol (January 2023, Conference on Empirical Methods in Natural Language Processing)
Can LMs Learn New Entities from Descriptions? Challenges in Propagating Injected Knowledge

Onoe, Yasumasa; Zhang, Michael J.Q.; Padmanabhan, Shankar; Durrett, Greg; Choi, Eunsol (January 2023, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers))

Full Text Available

« Prev Next »

Search for: All records